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Abstract 

Maximum pseudolikelihood method has been among the most important methods for learning 
parameters of statistical physics models, such as Ising models. In this paper, we study how pseu- 
dolikelihood can be derived for learning parameters of a mixture of Ising models. The performance 
of the proposed approach is demonstrated for Ising and Potts models on both synthetic and real 
data. 
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Inference in models of statistical physics, such as Ising models, is generally challenging 
because of the intractable partition function (normalizing constant). Inference could be in 
the form of obtaining statistical properties of the model, e.g., magnetization, (forward prob¬ 
lem) or learning parameters of the model which are most likely to have generated a given 
dataset (inverse problem). Both forms of the problem proves difficult in the presence of 


unknown partition functions. Methods based on Markov chain Monte Carlo (MCMC) |l| or 


mean held approaches [2| have been widely used for both problems. Maximum pseudolike- 


lihood (MPL) y is another popular method successfully applied to many inverse problem 
applications |^, 1^. MPL is a consistent estimator for parameters, i.e., asymptotically (num¬ 
ber samples going to inhnity) it recovers true parameters jb]. Therefore, higher accuracy 


is expected especially when high amount of data is available. For example, in 


he held of 
4, 


.S- 


direct coupling analysis, MPL is currently known as the most accurate method 

In this paper, we study the problem of learning parameters of a mixture of Ising models. 
It may be the case that data at hand can be explained / may have been generated by 
more than one Ising model. Learning one set of parameters (one model) in this case may 
cause some (or even none) of the data samples not to be represented by the model. Below, 
we propose a method based on pseudolikelihood to learn parameters of a mixture model. 
Training a mixture of K Ising models does not have much overhead and is as efficient as 
training K separate Ising models on the same data. 

A mixture of Ising models is a superposition of K Ising models and its pdf is given by 


K 


p(s|7r, 0 ) = ^ T^kPk{s\9k ), 


( 1 ) 


k=l 


where vr^ are mixing coefficients with J2k=i = 1 cind 0^ are parameters of fcth Ising 
model which comprises coupling parameters external helds hf and inverse temperature 

/3^ = 1/T^. The density of an individual Ising model is 

^ Gk) _ Ei h'lSi + 13^ J2i<j JfjSiSj) 

Pk\,^\9k) 7 (Q \ 7 (f) \ 

^k[^k) ^k[tfk) 

where Zk{Ok) is the partition function (normalizing constant) of which computation is in¬ 
tractable. The observed variables s are binary of dimension N, i.e., s G { — 1, 

In the inverse problem of mixture of Ising models, the goal is to learn parameters of the 
model {vTfc, 0fc}Ei form a dataset S = with B samples. When Zk{Ok) 
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are analytically available, the standard way for the inverse problem is maximnm likelihood 
estimation 


K 

=arg max logp(5|7r, 0) = arg max V logV • (3) 

{nk,0k}^^l V k^l 

If we maximize logp(S'|7r, 6) w.r.t mixing coefficients vr^ (nsing a Lagrange mnltiplier for 
^ TTfc = 1), we obtain 


B 




nkPk{sb\6k) 


T.f=i^jPjisb\Oj 


— ^ ^ 'Ibk ) 


6=1 


( 4 ) 


where we may dnb the right hand side “responsibility” of mixtnre k for sample b and 
represent it with Of conrse, is not analytically available dne to nnknown normalizing 
constants, Zk- 

'jbk can be approximated however by approximating Zj. nsing MCMC methods or mean 
held lower bonnds. Then, the gradient of the log likelihood w.r.t 6^ is written 

d\ogp{S\7T,6) -A f dlog(j)k{sb-, Ok) d\ogZk{Ok)\ . . 

-^ h''" V-ae;- 9^) ■ ' ’ 


This can be accomplished by again nsing MCMC estimates of dlog Zk{0k)/dOk, Moreover, 
it gives a hint that other standard methods like psendolikelihood or mean held eqnations 
can be derived for optimization of OkS from individnal Ising models. Below, we derive 
psendolikelihood for optimization of Mol. 

We introdnce the latent variable yb for each data sample, has a mnltinomial distri- 
bntion with one trial, i.e., pbk e {0,1}, Y^kVbk = 1, piVbk = 1) = T^k, piVb) = ■ Vb 

determines which mixture the data point Sb belongs to. 

Each mixture is given by an Ising model based on the following conditional probability 


p(st,\ybk = 1 ) = Pt{sb\0t), 


( 6 ) 


which in turn leads to 


Pi^blVb, G) = Wpk{.Sb\0kY ^^. (7) 

k 

The marginal of this model is the same as in ([T]) and the posterior is p{ybk = 1|'S6) = Jbk as 
in (jl]). 
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In order to obtain the full conditional distributions of the observed variables, we make 
use of the following equality 


p{Si,_n\0) 

P{Sb\0) 


E 

Vb 


p[Sb,yb\0) Ubn 


( 8 ) 


where denotes the vector with nth variable flipped. With 1/ubn we can write log 
pseudolikelihood as follows 


log PL = — EE \0g{Ubn/{l + Ubn)) ■ 


B 

b n 

For the mixture Ising model, ([8]) is simplihed as follows 

(j)k{Sb-n\Ok)^'’'^ 


(9) 


^jUbn — 'y ^ 

Vb 

= E 


4’k{,^b,—n I 


p{.yb\^h,'^,o) 




p{ybk = l|sfe,7r, 0). 


( 10 ) 

( 11 ) 


This suggests that we can build an EM-like iterative algorithm where we hrst estimate the 
responsibilities at iteration t, p{ybk = 11^6, tt*, 0*), then update mixing coefficients tt and 
the parameters of individual Ising models 6^- Responsibilities can be estimated using any 
of the approaches mentioned above, such as MCMC, mean field lower bound, etc. In this 
work, we propose to estimate them using pseudolikelihood, i.e., use PLfc instead of in (jl]). 
This results in a very efficient method where both estimation and optimization steps are 
done using pseudolikelihood. 

We used infinite range (IR) Ising model to show the good performance of pseudolikelihood 
on both single models and mixtures. IR models have the same coupling parameter between 
all pairs of variables, i.e., Jij = J. We hxed the external fields to zero, hj = 0 for all 
variables and [3 = 0.001. Then, generated two datasets Si and S 2 using J = 1 and J = 3. 
Pseudolikelihood surfaces w.r.t. J on these two datasets are given in the top row of FigUl 
The J* values which maximize these pseudolikelihood curves agree with the parameter values 
used to generate data. If we concatenate the datasets, i.e., S = Si + S 2 , and obtain the 
pseudolikelihood curve on this single dataset (Fig{T] bottom row, left plot), we see that the 
optimal J value is neither of the original values (1 or 3). On the contrary, if we model S 
with a Mol with two mixtures and plot the pseudolikelihood surface (Fig{T] bottom row, 
right plot), we see that the optimal parameter values J* = 1 and J* = 3 coincide with the 
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Individual Datasets: 




Concatenated Dataset: Single Ising (left), Mixture of Isings (right) 



Sj+S^ (K=2) 



FIG. 1. Infinite range model. N = 1000, B = 100, /3 = 0.001. 


original values. Note that, the symmetry in the model results in unidentihability, i.e., there 
are two symmetric peaks. 

The proposed Mol learning method can also be extended to mixtures of Potts (MoP) 
models. A Potts model is basically an Ising model where variables are discrete with more 
than two states. Pseudolikelihood for MoP is easily accomplished by considering the flipped 
state Sfe as a collection of all states except We tested PL for MoP in direct coupling 
analysis for protein structure prediction problem where PL is one of the most successful 
approaches We consider two protein families (PF00076 and PF00105) from the PFAM 
database which has the same aligned sequence length {N = 70). Contact-detection results 
on individual datasets with single Potts models are presented in Fig. [2] (top row). The plots 
are true positive (TP) rates w.r.t. the number of predicted contacts, based on pairs with 
\i- i\ > 4. 

We concatenated these two datasets and tried to learn two different sets of parameters 
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using PL for MoP with K = 2. With random initialization of 'jbkS (Fig. HI middle row), the 
TP rates obtained with one of the Potts models (right) is not as good as the single Potts 
model on PF00105 only. However, the results can be improved substantially with a more 
informed initialization. This can be achieved in many ways. We considered here choosing K 
samples (codewords) furthest away from each other from a random subset and assigning the 
rest of the samples to the closest codeword through With this better initialization, the 
TP rates are as good as (Fig. HI bottom row) the single Potts models on original (separate) 
datasets. 

In summary, we proposed a very simple but efficient way of training mixtures of Ising 
models. We showed that when the data includes samples with different characteristics, 
indeed a mixture model explains the data better and the overall inference is beneficial. Our 
method of choice here was pseudolikelihood, but a similar approach can be pursued using 
mean field as well. The method is as efficient as learning K Ising models, without bringing an 
overhead stemming from handling mixtures. An immediate future work is model selection, 
i.e., selecting the optimal K also from data. 
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Ground Truth: 


TP Rates (Potts 1) TP Rates (Potts 2) 



Random Initialization: 

TP Rates (Potts 1) TP Rates (Potts 2) 



Better Initialization: 

TP Rates (Potts 1) TP Rates (Potts 2) 



FIG. 2. PF00076 & PF00105 concatenated. TP rates on individual families (top row): PF00076 
(left), PF00105 (right); MoP results with random initialization (middle row) and with a more 
informed initialization (bottom row). 
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